
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\section{{\tt K-ADWIN} = {\tt ADWIN} + Kalman Filtering}
\label{Skadwin}

%In the sequel, whenever we say {\tt ADWIN} we really mean its
%efficient implementation, {\tt ADWIN2}. 

%\subsubsection{The Kalman Filter}
%\label{Sskalman}

One of the most widely used Estimation algorithms is the Kalman filter. We give here a description
of its essentials; see \cite{welch} for a complete introduction.
%The Kalman filter is an optimal recursive data-processing algorithm that generates estimates of the variables 
%(or states) %of the system being controlled by processing all available measurements. 

The Kalman filter addresses the general problem of trying to estimate the state $x \in \Re^n$ 
of a discrete-time controlled process that is governed by the linear stochastic difference equation
$$x_t=Ax_{t-1} + B u_t + w_{t-1}$$
with a measurement $z \in \Re^m$ that is
$$Z_t = H x_t + v_t.$$
%
The random variables $w_t$ and $v_t$ represent the process and measurement noise
(respectively). They are assumed to be independent (of each other), white, and with
normal probability distributions
$$p(w) \sim N(0,Q) $$
$$p(v) \sim N(0,R). $$
%
In essence, the main function of the Kalman filter is to estimate the state vector 
using system sensors and measurement data  corrupted by noise.

The Kalman filter estimates a process by using a form of feedback control: the filter
estimates the process state at some time and then obtains feedback in the form of (noisy)
measurements. As such, the equations for the Kalman filter fall into two groups: time
update equations and measurement update equations. The time update equations are
responsible for projecting forward (in time) the current state and error covariance
estimates to obtain the a priori estimates for the next time step. 
$$x^-_t=A x_{t-1} + B u_t$$
$$P^-_t= AP_{t-1} A^T +Q$$
%
The measurement update equations are responsible for the feedback, i.e. for 
incorporating a new measurement into the a priori estimate to obtain an improved a posteriori estimate.
%
$$K_t=P^-_t H^T(H P^-_tH^T+R)^{-1}$$
$$ x_t=x_t^- + K_t(z_t -Hx_t^-)$$
$$P_t=(I-K_t H) P^-_t.$$
%
There are extensions of the Kalman filter (Extended Kalman Filters, or EKF)
for the cases in which the process to be estimated or the measurement-to-process
relation is nonlinear. We do not discuss them here. 
%Basically, % we change our matrixes $A$ and $H$ for some nonlinear functions $f$ and $h$ A Kalman filter that 
%that linearizes about the current mean and covariance.

In our case we consider the input data sequence of real values $z_1, z_2, \ldots,$ $ z_t, \ldots$ 
as the measurement data. The difference equation of our discrete-time controlled process is the simpler one, 
with $A=1, H=1, B=0$. So the equations are simplified to:
%
$$K_t= P_{t-1}/(P_{t-1}+R)$$
$$X_t=X_{t-1}+ K_t(z_t -X_{t-1})$$
$$P_t=P_t(1-K_t)+Q.$$
%

Note the similarity between this Kalman filter and an EWMA estimator, taking $\alpha = K_t$.
This Kalman filter can be considered as an adaptive EWMA estimator where $\alpha = f(Q,R)$ is calculated
optimally when $Q$ and $R$ are known.

The performance of the Kalman filter depends on the accuracy of the a-priori assumptions:
\begin{itemize}
\item linearity of the difference stochastic equation%, and normal probabilities with zero mean of covariances Q and R.
\item estimation of covariances $Q$ and $R$, assumed to be fixed, known, 
      and follow normal distributions with zero mean.
\end{itemize}  
%
When applying the Kalman filter to data streams that vary arbitrarily over time, both
assumptions are problematic. The linearity assumption for sure, but also the assumption
that parameters $Q$ and $R$ are fixed and known -- in fact, estimating them from the data
is itself a complex estimation problem. 



{\tt ADWIN} is basically a linear Estimator with Change Detector that makes an efficient use of 
Memory. It seems a natural idea to improve its performance by replacing the linear estimator by 
an adaptive Kalman filter, where the parameters $Q$ and $R$ of the Kalman filter are computed
using the information in {\tt ADWIN}'s memory. 

We have set $R=W^2/50$ and $Q=200/W$, where $W$ is the length of the window
maintained by {\tt ADWIN}. While we cannot rigorously prove that these are the optimal choices, 
we have informal arguments that these are about the ``right'' forms for $R$ and $Q$, on 
the basis of the theoretical guarantees of {\tt ADWIN}. 

Let us sketch the argument for $Q$. Theorem \ref{ThBV}, part (2) gives a value $\epsilon$
for the maximum change that may have occurred within the window maintained 
by {\tt ADWIN}. This means that the process variance within that window is at most $\epsilon^2$, 
so we want to set $Q=\epsilon^2$. 
In the formula for $\epsilon$, consider the case in which $n_0 = n_1 = W/2$, then we have
$$
\epsilon \ge 4\cdot \sqrt{{\frac{3 (\mu_{W_0}+\epsilon)}{W/2}} \cdot \ln{\frac{4W}{\delta}} }
$$
Isolating from this equation and distinguishing the extreme cases in which 
$\mu_{W_0} \gg \epsilon$ or $\mu_{W_0} \ll \epsilon$, it can be shown that 
$Q=\epsilon^2$ has a form that varies between $c/W$ and $d/W^2$. Here, $c$ and $d$ are constant
for constant values of $\delta$, and $c=200$ is a reasonable estimation. This justifies
our choice of $Q=200/W$. A similar, slightly more involved argument, 
can be made to justify that reasonable values of $R$ are in the range $W^2/c$ to $W^3/d$, 
for somewhat large constants $c$ and $d$.

When there is no change, {\tt ADWIN} window's length increases, 
so $R$ increases too and $K$ decreases, reducing the significance 
of the most recent data arrived. 
Otherwise, if there is change, {\tt ADWIN} window's length reduces, 
so does $R$, and $K$ increases, which means giving more importance to the last data arrived.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

